Skip to main content

Step 1: Loading & Exploring Datasets

Study Area & H3 Resolution

We will use the entire United States as our study area.

This analysis will be done using H3 hexagons to join these datasets together. Each of these datasets originally come as rasters. We've ingested them to H3 hexagons, though we need to decide on a H3 resolution to perform our analysis on.

Here is a comparison of different H3 resolutions over the same study area:

Area of InterestH3 Hex ResolutionAver. H3 Cell Size# H3 Cells in Area
Continental United States*4~1,770 km²~10,000
Continental United States*5~252 km²~70,000
Continental United States*6~36 km²~49,000
  • United States being defined by the bounds: [-130, 25, -60, 50], i.e. approximately 17.6M km² (US is about 9.8M km²)

See the H3 Cell Counts Stats for more details

We'll decide to use a H3 resolution of 5 for this analysis.

H3 rule of thumb

Going up a resolution by 1 level reduces the area of each hexagon by a factor of 7. This means that at resolution 5, each hexagon is 7x7 = 49x smaller than at resolution 3.

Keep this in mind as more hexagons also require longer compute time.

Getting number of H3 cells for bounds and resolution

We can write a simple UDF to estimate the number of H3 cells for a given bounds and resolution.

@fused.udf
def udf(bounds: fused.types.Bounds = [-130, 25, -60, 50], res: int = 5):

# Load common library for H3 utilities
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")

# Convert bounds to GeoDataFrame and calculate UTM area
gdf_bounds = common.bounds_to_gdf(bounds)
gdf_bounds = common.add_utm_area(gdf_bounds)
area_millions = round(gdf_bounds['utm_area_sqm'].iloc[0] / 1e12, 1)
print(f"Bounds area: {area_millions}M square km")

# Get H3 hexagons within bounds at specified resolution
df_hex = common.bounds_to_hex(bounds=bounds, res=res, hex_col="hex")
print(f"{df_hex.T}")

return df_hex

Returns:

>>> Bounds area: 17.6M square km
>>> 0 ... 67634
hex 599309241731252223 ... 600345878259040255

[1 rows x 67635 columns]

This result is 17.6M km², i.e. about 1.8 times more than the actual area of the United States (9.8M km²).

This is because the bounds is a rectangle around the United States, not the exact boundary of the United States:

Data Ingestion: Raster to Hex

There are a few options for ingesting raster data to H3 hexagons

MethodProcessing MethodUsabilityHow To
Single FileArea of Interest in 1 UDFRecompute required for every new area / resolutionSee Example UDF: DEM_Tile_Hexify (Set to Single UDF Mode)
Dynamic TileArea of Interest in parallel UDFsRecompute required for every new area / resolutionSee H3 Tiling Docs Section
Full IngestionOnly onceRead onlyReach out to Fused (info@fused.io)

The rest of this example assumes datasets are already in some hex format.

Dataset 1: Crops

Data

For simplicity we already created a hex dataset in H3 cell size 5 for 2024: s3://fused-asset/misc/hex/CDL_h12k1p1/year=2024/overview/hex5.parquet

Reading data in UDF

  • Load the hex dataset using DuckDB from a .parquet on S3
# cdl_data_all_crops
@fused.udf
def udf():
import duckdb

df = duckdb.query(f"""
SELECT *
FROM 's3://fused-asset/misc/hex/CDL_h12k1p1/year=2024/overview/hex5.parquet'
and area>1000
""").to_df()

df = df.rename(columns={'data': 'crop_type'})

return df

Returns:

>>>      0             1       ...        706262        706263
crop_type 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00
hex 5.992301e+17 5.992302e+17 ... 6.003962e+17 6.003962e+17
area 6.128944e+07 2.849653e+08 ... 2.162687e+07 2.917735e+08
chunk_id 5.767422e+17 5.767422e+17 ... 5.779033e+17 5.779033e+17
year 2.024000e+03 2.024000e+03 ... 2.024000e+03 2.024000e+03

This UDF:

  • Loads all the data as hex cells
  • Only keeps cells with an area greater than 1000 square kilometers (filters out smallest crops types)
  • Returns 1 row per crop type per cell (so each H3 cell can have multiple rows)

Visualizing a single crop type

Fused UDFs are serverless Python functions, so we can also create visualizations with them. We've built a simple UDF library to render data on maps.

We can create a new UDF that loads the data and filters for a specific crop type (you can see the CDL Crop codes here)

Code: Map UDF for a single crop
@fused.udf
def udf(
crop_type=1 # Crop 1 = Corn
):
utils = fused.load('team/map_utils')

data = fused.run("cdl_data_all_crops")
# Keeping only the specific crop_type selected
data = data[data['crop_type'] == crop_type]

# keeping subset of cols so tooltip only shows these
data = data[['crop_type', 'hex', 'area']]

print(data)
config = {
"initialViewState": {
"longitude": -95.7129,
"latitude": 37.0902,
"zoom": 3
},
"hexLayer": {
"@@type": "H3HexagonLayer",
"filled": True,
"pickable": True,
"extruded": False,
"getHexagon": "@@=properties.hex",
"getFillColor": {
"@@function": "colorContinuous",
"attr": "area",
"domain": [0, 30_000_000],
"steps": 20,
"colors": "Peach"
}
}
}
return utils.deckgl_hex(
data,
config=config
)

This returns a standalone UDF:

Visualizing multiple crop types

In Fused Canvas we can create different visualization of the same UDF simply by filtering for a specific crop type. We'll create:

  1. Our original cdl_data_all_crops UDF loading all crops types
  2. A UDF returning a map of crop_type = 1 (Corn)
  3. A UDF returning a map of crop_type = 3 (Rice)
  4. A UDF returning a map of crop_type = 111 (Open Water)

Explore this Fused Canvas for yourself:

  • Link to live canvas
  • Click the "Download Canvas" button to download the canvas .zip file and upload it directly in your Workbench

Dataset 2: Elevation

Data

  • Original Raster tiles from Terrain Tiles
  • Hex partitioned dataset by Fused: s3://fused-asset/data/dem/r6_z4_100k.parquet (See on File Explorer)

Reading Data

The .parquet file we'll use isn't at hex 5 resolution, so we will dynamically:

  • Use h3_cell_to_parent(hex, 5) as hex, to convert the hex to the parent hex at resolution 5
  • Use avg(mean_value) as mean_value to aggregate by average elevation of all the elevation values covering the parent hex
  • Keep only values within the bounds viewport with h3_cell_to_lng(hex)
@fused.udf
def udf(
bounds:fused.types.Bounds=[-130, 25, -60, 50]
):
# Loading the common Fused hex library to access pre-installed H3 library in DuckDB
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()

query = """
SELECT
h3_cell_to_parent(hex, 5) as hex,
avg(mean_value) as mean_value
FROM read_parquet('s3://fused-asset/data/dem/r6_z4_100k.parquet')
where 1=1
and h3_cell_to_lng(hex) between ? and ?
and h3_cell_to_lat(hex) between ? and ?
group by 1
"""
data = con.execute(query, [bounds[0], bounds[2], bounds[1], bounds[3]]).df()

return data

Returns:

                   0             1      ...         68042         68043
hex 5.992300e+17 5.992301e+17 ... 5.996693e+17 5.996693e+17
mean_value 3.876500e+02 2.390833e+02 ... 1.530952e+02 1.819881e+02

Visualizing Elevation

We can visualize the elevation data similarly to the crop data:

Show code details
@fused.udf
def udf():
utils = fused.load('team/map_utils')
data = fused.run("elevation_hex_mean_height")
print(data.T)

config = {
"initialViewState": {
"longitude": -95.7129,
"latitude": 37.0902,
"zoom": 3
},
"hexLayer": {
"@@type": "H3HexagonLayer",
"filled": True,
"pickable": True,
"extruded": False,
"getHexagon": "@@=properties.hex",
"getFillColor": {
"@@function": "colorContinuous",
"attr": "mean_value",
"domain": [0, 3000], # Rough elevation range in meters in the US
"steps": 20,
"colors": "Earth"
}
}
}
return utils.deckgl_hex(
data,
config=config
)

Joining the two together on a Fused Canvas:

Explore the Fused Canvas:

  • Link to live canvas
  • Click the "Download Canvas" button to download the canvas .zip file and upload it directly in your Workbench

Dataset 3: Temperature

Data

Reading Data

We'll create a UDF that takes:

  • A given year & month as YYYY-MM
  • Bounds list

And returns:

  • Average temperature for any given hexagon over that month

We can load the data using DuckDB and visualize it using the deckgl_hex function:

@fused.udf
def udf(month='2024-05',
bounds:fused.types.Bounds=[-130, 25, -60, 50]
):
path = f's3://fused-asset/data/era5/t2m_daily_mean_v2_1000/month={month}/0.parquet'

common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()

query = """
SELECT
hex,
avg(daily_mean) - 273.15 as daily_mean
FROM read_parquet(?)
WHERE 1=1
AND h3_cell_to_lng(hex) between ? and ?
AND h3_cell_to_lat(hex) between ? and ?
GROUP BY 1
"""

data = con.execute(
query,
[path, bounds[0], bounds[2], bounds[1], bounds[3]]
).df()

print(f"{data.T=}")

return data

Note a few things here:

  1. The dataset is not in hex 5, this is a different resolution than the other datasets we've been working with.
  2. The data has a lot of "empty" areas.

This is because the underlying data, the ERA5 dataset is a much coarser spatial resolution than the other 2 datasets (around ~31km at the equator, compared to 30m for the Crop Data Layer and 5m for the Elevation data)

We can smooth the data to the hex 5 resolution using a k-ring neighborhood aggregation:

@fused.udf
def udf(month='2024-05',
bounds:fused.types.Bounds=[-130, 25, -60, 50]
):
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
import pandas as pd
path = f's3://fused-asset/data/era5/t2m_daily_mean_v2_1000/month={month}/0.parquet'

con = common.duckdb_connect()

query = """
WITH base_data AS (
SELECT
h3_cell_to_parent(hex, 5) as hex,
avg(daily_mean) - 273.15 as daily_mean
FROM read_parquet(?)
WHERE 1=1
AND h3_cell_to_lng(hex) between ? and ?
AND h3_cell_to_lat(hex) between ? and ?
GROUP BY 1
),
all_hexes AS (
-- Get all hexes including neighbors by expanding k-ring
SELECT DISTINCT unnest(h3_grid_disk(hex, 1)) as hex
FROM base_data
),
smoothed AS (
-- For each hex (original + neighbors), get its k-ring neighbors
SELECT
ah.hex as hex,
unnest(h3_grid_disk(ah.hex, 1)) as neighbor_hex
FROM all_hexes ah
)
SELECT
s.hex,
round(avg(b.daily_mean), 2) as daily_mean
FROM smoothed s
JOIN base_data b ON b.hex = s.neighbor_hex
GROUP BY s.hex
"""

data = con.execute(
query,
[path, bounds[0], bounds[2], bounds[1], bounds[3]]
).df()

print(f"{data.T=}")

return data

Visualizing a single month

As above we can visualize the temperature data across a single month:

Visualizing multiple months

We can create a Fused Canvas to visualize data of different months at once. We create a 'recap' Canvas to show, from left to right:

  1. Original dataset at original hex resolution
  2. Convert to parent hex at resolution 5
  3. Smooth the data to the hex 5 resolution -> Visualize 3 different months at once

Explore the Fused Canvas:

  • Link to live canvas
  • Click the "Download Canvas" button to download the canvas .zip file and upload it directly in your Workbench

Summary & Next Steps

We now have each dataset at the same hex5 resolution over the same bounds of the United States. Next: